Assignment #3 - When Cats meet GANs

Adithya Praveen

I. Project Description

The goal of this project is to understand Generative Adversarial Networks (GANs) better by getting our hands dirty and implementing some popular GAN architectures like:

Deep Convolutional GAN (DCGAN): More like the original GAN use-case of mapping noise to realistic images. DCGAN just uses a better model, and we use it here to generate realistic images of grumpy cats.

CycleGAN: This network helps us understand how to use GANs to translate images from one domain (for example, grumpy cats) into another domain (russian blue cats) and vice-versa - a cooler application of GANS in my opinion.

II. DCGAN

Method

Just like a typical GAN, the DCGAN architecture also has a generator and a discriminator. The main difference is DCGAN's use of convolutional layers combined with normalization and activation layers as shown in the diagrams below:

Note: Real images have a label of 1, and fake images have a label of 0.

As with GANs in general, we begin by sampling random noise which is fed into the generator - giving us our batch of "fake images". Both networks are now ready to "learn".

The discriminator has two missions: Classify all real images as real, and all fake images as fake. So the first term below (D’s ability to classify real images as real) is zero when D outputs 1 for real images. The second term (D’s ability to classify fake images as fake) follows similar logic. Mathematically, this can be expressed as follows: $$J^{(D)} = \frac{1}{2m}\sum_{i=1}^{m}\left[\left(D(x^{(i)}) - 1\right)^2\right] + \frac{1}{2m}\sum_{i=1}^{m}\left[\left(D(G(x^{(i)}))\right)^2\right]$$

The generator's job is to fool the discriminator. Said differently, G wants D to always output 1 for the images it creates - that’s exactly when the G’s loss function maps to zero. This can be concisely expressed as: $$J^{(G)} = \frac{1}{m}\sum_{i=1}^{m}\left[\left(D(G(x^{(i)})) - 1)\right)^2\right]$$

The proof is in the padding [I'm sorry]

Here's the proof of the padding value p used in all (except the last layer) in DCGAN's discriminator. The exception is the last layer: Since we want a single output, padding doesn't really make sense in that context (therefore it was set to 0). In all other layers, we use a padding value of 1 as shown below:

Loss Curves

A few notes on the loss curves:

The first set of results show loss curves for basic data preprocessing without and with differentiable augmentations. The second is the same, but with deluxe data preprocessing.

Differentiable augmentation (shown in "green" for both sets of results), clearly performs better since the generator is able to achieve a lower loss, and conversely, the discriminator has a harder time differentiating between reals and fakes.

With deluxe preprocessing, on average the performance is better (with a slightly lower loss for the generator, and a slightly higher loss for the discriminator respectively).

`--data_preprocess=basic [--use_diffaug]`

`--data_preprocess=deluxe [--use_diffaug]`

Generated Samples

Here, we show samples generated using deluxe data preprocessing without and with differentiable augmentation. In both cases, there is a clear improvement in quality going from iteration 200 to iteration 6400. Initially we have very blurred generations and towards the end of training, we see some cats that are very clearly grumpy.

It is also worth noting that most of the grumpy cats generated via DCGAN trained with differentiable augmentation, tend to have both eyes than the DCGAN trained without such augmentation. Simply put, training with differentiable augmentation tends to produce generations of relatively superior quality.

`--data_preprocess=deluxe`

`--data_preprocess=deluxe --use_diffaug`

III. CycleGAN

Method

In this section we deal with the task of image-to-image translation. The task is to convert images from a source domain (X) to look like images in a target domain (Y), while preserving the structural properties of X as much as possible. The CycleGAN architecture makes this possible in atleast two ways:

Skip Connections: The skip connection in the resnet blocks of the generator ("transformation" stage) helps preserve some of the characteristics of the input image in the generated image.

The cycle consistency loss: This loss explicitly biases the network towards preserving input source characteristics, by taking the p-norm between the input image and the "cycle" translated version (translate X to Y and then back to X), as formulated mathematically below: $$\frac{1}{m}\sum_{i=1}^{m}\|y^{(i)} - G_{X \rightarrow Y}(G_{Y \rightarrow X}(y^{(i)}))\|_p$$

Clearly, the above loss suggests that we have two generators in CycleGAN - one that translates from X → Y, and another that translates from Y → X (both share the exact same architecture). Additionally the (patch) discriminator classifies patches instead of the entire image as being real or fake - to better model local shapes. We only include the architecture of the generator below, since the discriminator is essentially the DCGAN discriminator without the final layer.

Early Results

Here are some of the earlier results of our CycleGAN with patch discriminator on the first dataset of cats (Grumpy and Russian Blue). These are results after a 1000 iterations, and we have two sets of results without and with cycle consistency loss. Observations are stated in the Final Results: Cats section.

`--disc patch --train_iters 1000`

`--disc patch --use_cycle_consistency_loss --train_iters 1000`

Final Results: Cats

Cycle Consistency Comparison

As in the previous section, the second set of results use cycle consistency loss.

[Positive] We see that it's better in atleast two ways:

It is able to preserve more of the features of the original image, which is the whole point of cycle consistency loss. Assuming 1-based index, take a close look at (row 2, col 3) and (row 4, col 2) of russian-blue to grumpy. We see that the background color (green and red respectively) is better preserved in the version that uses cycle consistency loss - which makes sense.

Preserves the color scale of the target domain. This is more evident on the right side where the color of the generated russian-blue kitties are more like the dataset when we use cycle consistency loss. Again this is probably because of the penalty in the loss function, biasing the generator to preserve the structures of the original image.

[Negative] Something that is not too great about using cycle consistency loss is the following:

In an attempt to preserve the background color (especially when it takes up a significant portion of the source image), the outputs with cycle consistency loss deforms the cats face too much (look at col 7 of row 1 & 3 of Russian-Blue to Grumpy). The results without cycle consistency look better in this regard.

`--disc patch --train_iters 10000`

`--disc patch --use_cycle_consistency_loss --train_iters 10000`

DC Discriminator Comparison

Here we compare the results of image to image translation when using DC discriminator instead of the patch discriminator used in the previous sections.

The results are more faded in this case, and this effect is more evident in the grumpy to russian-blue case. Probably because the generator network is no longer incentivized to preserve local structures because the discriminator only cares about the entire image.

`--disc dc --use_cycle_consistency_loss --train_iters 10000`

Final Results: Fruits

Cycle Consistency Comparison

Following the structure of the cats dataset, we also analyze the results from the fruits dataset.

[Positive]: It is worth noting that the orange to apple translation has more nuance in the version with cycle consistency loss. Instead of simply converting to a red blob, with cycle consistency we see lighter yellowish shading along with red, making it look more like an apple. This is especially evident when the source orange has been sliced (see top-left image).

[Negative]: The version with cycle consistency loss however, converts all white-ish objects to black which throws off the aesthetic some times. The effect is probably related to the choice of p-norm being 1 in the cycle loss.

`--disc patch --train_iters 10000`

`--disc patch --use_cycle_consistency_loss --train_iters 10000`

DC Discriminator Comparison

Without a patch based discriminator, the image translations seem to be more grainy and blurred, and this is expected.

`--disc dc --use_cycle_consistency_loss --train_iters 10000`

IV. Bells and Whistles

Train Diffusion Model

I trained a diffusion model from this HuggingFace notebook that introduces us to their new "diffusers" library - "diffusers" makes it incredibly easy to play around with diffusion models on your own datasets. I specifically used the Denoising Diffusion Probabilistic model (DDPM) model pipeline.

I used 1000 timesteps for the denoising process, but I've included 10 equally spaced timesteps below - play with slider to see some 32x32 grumpy cats emerge out of random noise!

1 2 3 4 5 6 7 8 9 10

Pretrained Diffusion

Here are some anamorphic versions of grumpy and russian blue cats. I first optimized my prompt with the prompt optimizer promptist. I then generated samples using a pre-trained diffusion model from DreamBooth as shown below.

A grumpy cat going to work wearing a suit,RPG Reference, Oil Painting, Trending on Artstation, octane render, Insanely Detailed, 8k, HD

A russian blue cat coding on his laptop, RPG Reference, Oil Painting, Trending on Artstation, octane render, Insanely Detailed, 8k, HD

GAN Meme

And finally, here's what would happen if you let a GAN make a cat meme.